Analyzing the Indian Premier League

a tutorial by Hiteesh Nukalapati

Introduction

What is the Indian Premier League(IPL)?
It is the world's most popular, most watched, most sought-after cricket league.

What is Criket?
Criket is 11 member team sport consisting of batters and bowlers. It can be played in a stadium of any shape and size as long as the center has 22-yard rectangular box called the pitch. There are two innings(phases) in a match and the team that wins the toss, get to pick what they want to do first (i.e bat first or bowl first). So the bowlers bowls to the batter and A batter scores by hitting the bowl to get runs whereas the bowler aims to get wickets (i.e. the batter out). There are multiple formats of the game but the IPL follows the T20 format which stands for 20-20 which simply means that each team get to bowl 20 overs (each over consists of 6 balls, which means every innings has a maximum of 120 balls). A team can win the match in two main ways:

This is all the information you will need to understand and follow the tutorial but I encourage you to learn more about the game at https://en.wikipedia.org/wiki/Cricket. It is a beautiful game!

So why did I choose the IPL to create a tutorial?
It simply reminds my of India, cricket is a big sport in India and I care about it deeply. This seasons IPL was posponed indefinitely due to the raging COVID-19 Pandemic that has taken the country by storm. I planned on going to India this summer and looks like that is not happening. So i though what if I try to better understand the game, maybe along the way I will find some interesting facts about it and I will also miss the game a little less (Win-Win!!)

Set up

Locating and Loading the data

After a bit of googling, I found a data set that contains ball-by-ball data for IPL matches between 2008 and 2016 on kaagle: https://www.kaggle.com/manasgarg/ipl
This dataset consists of two CSV's (Common format of sharing data, stands for Comma Separated Value):


I will load both these files into two separate Pandas Dataframes. There is an inbuilt library function called read_csv(). This function simply takes the path to the file and converts it to a DataFrame (which is a 2D data structure used to store data, like a SQL table).

Tidying the data

When we look at the data, we realize that in the matches table(DataFrame), the umpire3 column is empty, filled with NaN's. So we can go ahead and and drop that column from our table.
When we look at the deliveries table, we see that a few entries in columns are NaN's.
Now, we must not drop these column (because they indicate something important when they actually contain a value) but we must make these columns useful and usable to ourselves. To do this, I will replace all NaN values in the deliveries table with 0's.
Doing this in this situation is very useful as we are not missing data, instead here, NaN's are used to indicate that the specific event did not occur. For this data, this is the best way to go forward but this might not always be the best choice, it always will depend on the type of data and analysis we want to perform.
For further simplicity I am going to abbreviate team names so that it is easier for us to type, display and utilize the same. We got all the unique teams in the data by running matches['team1'].unique()

Basic Analysis

We've used a few pandas library functions to perfrom some basics analysis.
As we can see, over the years, the tournament has been played at 30 locations (grounds, to be more precise) most of these locations are in India, but if we look closely, some of them are in the UK, UAE and South Africa!!
The MVP, with the most Man of the Match awards is none other than Chris Gayle! (If you follow cricket closely, you'll know what an impact the player has on the team and the game) Lets look at every teams success so far:

As we can see, MI (Mumbai Indians) is the most successful IPL team, infact they've the tournamnet 5 times now!
As we can see, visualization is an important aspect of data analysis, it lets us put things in perspective. Here, we used the matplotlib and seaborn libraries to plot the above graph. We created a barplot with using the seaborn library barplot function. We always have to send data to a visualization function and sending the right data is very important. So, above, we created a new dataframe called team_wins_df that holds the exact data we need to plot the above graph.

Further Analysis and Visualization

In the game of cricket, toss is a very important factor. Cricket gives the team winning the toss an edge, because they get their preffered decision (to either bowl first or bat first) this decision is made by the captain and coach after taking into considerations like ground size, due(mist) factor, opposition, time of the day etc.
Lets take a closer look at Toss, Toss Decisions and the way it affects the game

We use matches.shape[0] to get the total number of matches and we use matches['toss_decision'].value_counts() to count the occurence of each event.
As we can see 57% of the teams winning the toss decide to Field (Ball) first and the rest chose to Bat first

Lets look at the Toss decisions across seasons:

2016 was the year there was the hishest divide between teams choosing to feild first vs those chosing to bat first while 2012 was the season with the lowest divide
Unsrprisingly, across years, there is a lot of variation as to how teams choose a decision at a toss. This is can be because of the location the teams were playing: Some conditions call for feilding first while others call for batting first.
Lets see where the 2012 and 2017 season were played.

Turns out, both the 2012 season and 2017 season were played in India! So it might not be the location then. It could just be a shift in strategy by coaches and teams

Lets look at the teams that won the most number of tosses:

As we can see, MI has won the most tosses! also this barplot is very similar to the barplot above labelled "Total Victories of IPL Teams". How is it similar, we can see that most of the teams that are successful are also good at winning tosses! So toss can be an important factor in cricket! Please note, it is important to understand that this chart, in no way, indicates that MI and other teams winning a lot of tosses usually have a higher chance of winning the toss. That is the teams on the lower end GL, RPS, KTK do not have a bad chance at winning the toss, they just did not play enough games!!. This is a very important observation

Let's see if the toss winner is also the match winner

Important fact: The probability of winning the toss is equal for both the teams as a 2 sided coin is flipped at the toss.
From the above pie chart, we see that winning the toss does not necessarily mean winning the match!
At this point, It is also nice to appreciate the fact that we have all these inbuilt library functions that make our lives easy. All we have to do is filter data to a format that would suit the visualixation we choose.
Above, we used the matplotlib pie method to create the pie chart.

What about finals?

Lets see how the Toss and other factors impact the tournament decider!

WOW! 83% of teams that win the toss in the decider with the match!
This could just beacuse the team winning the toss is under less pressure but there could definitely be other factors that are not visible in this data. Toss is definitely an important factor when it comes to final

Lets see How a decision after winning the toss affects the outcome of the game:

Above, True=Winning/Won and False=Losing/Lost.
Looks like the team winning the toss choosing to Bat first has won the final the most times. Captains and coaches should definitely take this into consideration! while making a decison in the final!

Now, lets look at a few interesting and important stats.

I will explain what the significance of each statistic is and how it affects the team.

Runs across seasons.

Batters are responsible for the runs each team scores.The more the runs, the harder it is for the team batting second to win the match, this also gives the team bowling second a good leeway to get the batters of the team batting second out. So, in simple terms, The more the runs a team scores, the better chance they have at winning the match. This an indicator of how competetive teams are.
To do this, we create batters_df to store the only batter data we need. The data we need exists in both the matches table and the deliveries table. We will be using the match_id column to our advatage here to merge the two tables(as it is unique for every match) using a left join. into seasons_df. The generated graph shows that, in genral, teams (all together) have increased the number of runs scored across seasons. Just this (isolated) indicates that the tournament was the most competitve in the 2013 season. There is sharp dip from there on in terms of the runs being scored.

Runs per match across seasons.

This is the same statistic as above except calculated per match. This gives a more granular look at how the tournamnet had progesses across seasons.
We do this in the same way as above by creating a new dataframe with only the data we need.
The graph genr

Total Matches vs Total Wins for each team